The Fitness-Corrected Block Model, or how to create maximum-entropy data-driven spatial social networks

Models of networks play a major role in explaining and reproducing empirically observed patterns. Suitable models can be used to randomize an observed network while preserving some of its features, or to generate synthetic graphs whose properties may be tuned upon the characteristics of a given population. In the present paper, we introduce the Fitness-Corrected Block Model, an adjustable-density variation of the well-known Degree-Corrected Block Model, and we show that the proposed construction yields a maximum entropy model. When the network is sparse, we derive an analytical expression for the degree distribution of the model that depends on just the constraints and the chosen fitness-distribution. Our model is perfectly suited to define maximum-entropy data-driven spatial social networks, where each block identifies vertices having similar position (e.g., residence) and age, and where the expected block-to-block adjacency matrix can be inferred from the available data. In this case, the sparse-regime approximation coincides with a phenomenological model where the probability of a link binding two individuals is directly proportional to their sociability and to the typical cohesion of their age-groups, whereas it decays as an inverse-power of their geographic distance. We support our analytical findings through simulations of a stylized urban area.

The definition of a suitable data-driven spatial social network model is a widely studied problem in computational social sciences. Various dynamical processes (e.g., diseases spread) can be represented on such networks, and the topology of the network has a direct impact on the evolution of the process. Having general models for social interactions, based on available data and capable of reconstructing stylized facts known from the literature, is of utmost importance to prevent from reaching conclusions biased by incorrect or ill-defined assumptions. With widely available survey and census data, it is now possible to generate synthetic geo-localized populations, stratified by age and organized into households. However, there is no equally direct way to accurately model interpersonal relationships.
Many real social networks exhibit some form of group mixing, driven by the tendency of individuals to socialize with their peers 1 . This property can be reproduced using the so-called Stochastic Block Model (SBM), risen to prominence as a way to generate networks with a known community structure 2,3 . In the SBM, the vertex set is partitioned into disjoint blocks and the probability of an edge between two nodes depends on the blocks to which the two nodes belong. In the original formulation of the SBM, all vertices belonging to the same block are indistinguishable, so that the degree distribution within each block tends to be Poisson-like for large graphs 2 . To produce a more realistic network, a few extensions to the model have been proposed in the literature, including the Degree-Corrected Block-Model (DCBM) 2 and its maximum-entropy version 4 . These models are, in some sense, the SBM-equivalent of the well-known configuration model. They are based on enforcing both the desired group mixing and a target degree-sequence, either exactly or in expectation. If L IJ is the number of links between blocks I and J -or twice that number, if I = J -and deg i is the degree of node v i , the maximum-entropy DCBM works by imposing that L IJ = K IJ and deg i = k i for suitable constants K IJ and k i . The internal consistency of the model requires J K IJ = i∈I k i for all I, and the density of the network is fixed equal to I,J K IJ N(N−1) , where N is the size of the network.
In this paper, we define and analyze the Fitness-Corrected Block Model (FCBM), a variation of the DCBM where the network density p is a configuration parameter. In the FCBM the block-level mixing is specified in terms of a matrix of edge-densities -as in the original SBM-whereas a sequence f of vertex intrinsic fitness OPEN 1 Institute for Applied Computing "Mauro Picone", National Research Council of Italy, Via dei Taurini 19, 00185 Rome, Italy. 2  www.nature.com/scientificreports/ values 5,6 measures the propensity of each vertex to establish links and can be used to enforce the desired intrablock heterogeneity. In the DCBM the constants K IJ and k i , that bound, respectively, L IJ and deg i , need to be known explicitly. The DCBM was in fact conceived as an instrument to randomize an observed graph while preserving some of its features. In the FCBM, instead, K IJ and k i are determined based on p, and f , making the FCBM a suitable model for generating random graphs whose properties may be tuned upon the characteristics of a given population or set of entities.
To the purpose of having a model that is maximally random, the FCBM is defined following the approach first presented in Ref. 4 for the DCBM, and later clarified and generalized in Ref. 7 . In a nutshell, the approach consists in: (1) the definition of an ensemble of networks, each with the same number of nodes, but with all possible configurations of links; (2) the constrained maximization of the entropy associated to the network ensemble, via the method of Lagrangian multipliers. By imposing the conditions L IJ = K IJ and deg i = k i , for all I, J, i, the probability per graph in the ensemble factorises in terms of probabilities per link. The numerical value of the Lagrangian multipliers has to be calculated solving the system of nonlinear equations given by the model constraints. The resolution of this system might become expensive when the network is large and, to the best of our knowledge, it is not implemented in any publicly available software library. For the present paper, we implemented a parallel version of the solver for the maximum-entropy DCBM, written in C making use of the Intel MKL scientific library and the OpenMP API, which can also be used to solve our FCBM. The solver is released as open-source software at https:// gitlab. com/ cranic-group/ dcbm_ solver.
A known sparse-regime approximation for the edge probability of the DCBM implies that, in the sparse FCBM, the probability of a link binding two individuals i and j is directly proportional to their sociability and to the cohesion of their blocks, i.e., p ij ∝ pf i f j I i J j for all i ∈ I and j ∈ J . Under the sparse-regime approximation, the maximum entropy condition leads to a system of equations that admits a closed-form solution. We make use of this approximate solution to find two closed-form estimates for the degree distribution p k of the FCBMone more accurate, the other neater. These two estimates put in direct relation p k with the fitness distribution p f , showing that the degree distribution of the FCBM, albeit not known a priori, can be essentially controlled through the model's parameters. In particular, if p f follows a power-law, lognormal or exponential distribution, then the same holds, approximately, for p k .
Among the many possible applications, the FCBM is especially well suited for generating a data-driven social network of geo-referenced and age-stratified individuals. The partition of the population into blocks can be obtained by grouping the individuals having similar position (e.g., residence) and age, whereas the expected block-to-block edge-density matrix can be calibrated based on survey data that quantify the dependence of contact frequencies upon geographic and socio-demographic factors. Finally, the fitness vector f may be drawn from a suitable probability distribution, modelled upon measurable features such as wealth, employment, or mobility. In this context, the sparse FCBM is well approximated by the phenomenological model presented in Refs. 8,9 . To show how the FCBM can be used in practice, and to provide empirical support to our analytical findings, we generate a set of synthetic social networks for a stylized urban population distributed on a disk of radius 2.5 Km. We use the SOCRATES 10,11 tool to extract age-based social mixing patterns, and we embrace the widelyaccepted assumption that an inverse-power-law relation binds the distance between two individuals and the frequency of their social interactions 12,13 . We generate instances of our FCBM using either the exact model or its sparse-regime approximation, varying both the spatial density of the individuals in the territory and the fitness distribution. We show that the empirical degree distribution obtained with all considered configurations is in sharp agreement with the analytical estimates. The implementation of the FCBM is publicly available as part of the Urban Social Networks (USN) framework at https:// gitlab. com/ cranic-group/ usn.

Contributions and results
In the following, we summarize the main contributions and present the main analytical and experimental results of this paper. For all methodological details, we refer the reader to the "Methods" section.
The Fitness Corrected Block Model. We propose the Fitness-Corrected Block Model (FCBM), a new maximum-entropy model for modular networks, parameterized by the network density p ∈ (0, 1) , the blockwise mixing structure -a symmetric matrix such that I,J I,J = 2-and the vertex-intrinsic fitness f -a vector that controls the tendency of each vertex to establish links with other vertices.
Formally, let V be a vertex set of size N partitioned into n blocks {B I } n−1 I=0 . The FCBM is defined as the maximum entropy probability distribution P, over all networks having vertex set V, fulfilling the following two conditions: where L IJ is the number of edges between B I and B J ; deg i , f i and I i are the degree, fitness and pertaining block of vertex v i ; �·� P denotes the expected value with respect to P.
The FCBM can be seen as a generalization of the Degree-Corrected Block Model (DCBM). However, contrarily to the DCBM, we show that the FCBM is consistent for any choice of the configuration parameters, making it suitable for generating random graphs with tunable topological properties.
(1) www.nature.com/scientificreports/ Leveraging on the framework of entropy-based null-models, we prove that the probability P(G) of generating a specific graph G with the FCBM can be factorized as the product of independent edge probabilities, namely P(G) = i,j p ij , where p ij is the probability of an edge between v i and v j . The edge probabilities can be recovered solving a system of non-linear equations obtained from (1) and (2). Efficient solver for FCBM/DCBM system of equations. The system of nonlinear equations, needed to explicitly calculate edge probabilities, becomes computationally expensive as the size of the network increases. Already with a few thousand nodes, a straightforward implementation may be too slow to be used in practical applications. We implemented a parallel C-program that efficiently solves this system, as well as the analogous system arising from the DCBM. The solver follows the Sequential Quadratic Programming approach presented in Ref. 14 , using Newton's method for the Hessian approximation. To the best of our knowledge, this is the first publicly available solver for the DCBM. The source code is publicly released as open-source software at https:// gitlab. com/ cranic-group/ dcbm_ solver.
Properties of the FCBM. The fitness sequence f guarantees intra-block heterogeneity. In fact, (2) can be rewritten as where deg I P is the expected total degree of block B I , i.e., the total number of edges incident to B I .
When the network is sparse, we show that the system from which all p ij 's must be derived admits a closedform approximate solution. The sparse-regime approximation allows to estimate the expected degree distribution of the sampled graph based on the fitness distribution p f . We derive two estimates for the degree distribution p k of the network. The first estimate reads where µ I is the expected average degree of the vertices in B I . The second estimate, less accurate but easier to interpret and use in practice, reads where µ is the average degree of the network. The obtained analytical expressions for p k have a very desirable property: for many choices of p f -including power-law, lognormal or exponential distributions-p k essentially has the same "shape" of p f . Data-driven FCBM for spatial social networks. We propose an application for our FCBM as a maximum-entropy model for data-driven spatial social networks. In particular, we envision its application to generate geographic networks informed by census data, contact surveys, and geospatial data. Indeed, the phenomenological model described at the end of this section has already been employed to develop a realistic social network at the urban scale 9 and to study the spread of an epidemic process 15-17 , on it.
Let the vertex set V describe an age-stratified population of N individuals living in a territory tessellated into square tiles of side l. Each v i is thus characterized by two data-driven discrete attributes: its tile of residence t i ∈ T , that is, the discretized position of v i in the territory, and its age-group g i ∈ Ŵ . These two attributes induce a partition of the population into n = |T| · |Ŵ| blocks and only if t i = t I and g i = g I . To define a data-driven block-wise mixing matrix , we observe that: • Social mixing patterns, derived from heterogeneous data sources such as surveys, cell phones or wearable sensors [18][19][20] , can be used to reconstruct a data-driven age-based mixing matrix S 9 , whose s IJ element measures the tendency of age groups g I and g J to socialize with each other. • The frequency of social relations between individuals living in t I and t J is generally assumed to decay as d −β IJ , where d IJ is the normalized (geographic or euclidean) distance between tiles t I and t J , and the exponent β > 0 depends on the type of relation and on the extension of the territory 12,13 . This leads to: Finally, we extract the fitness vector f from a suitable distribution p f , set the density parameter p ∈ (0, 1) and, for all pairs i, j, compute the edge probability p ij as prescribed by the FCBM. As a result, the graph sampling probability P guarantees that: (1) the expected number of links between B I and B J is proportional to s I,J and decays as d −β I,J ; (2) the expected degree of v i is proportional to f i and to the expected total degree of block I i . www.nature.com/scientificreports/ In this case, the sparse-regime approximation yields Expression (6) defines a phenomenological model, analogous to the one presented in Ref. 9 , where the probability of two individuals being connected is proportional to their sociability and to the cohesion of their agegroups, while decaying as a power of their distance. Clearly, the estimates obtained in (3) and (4) for the degree distribution stay valid in the data-driven model. If the network is sufficiently sparse and the population of all tiles/groups is sufficiently large, the degree distribution of the sampled graph G is controlled by the available data and by the chosen fitness distribution p f . Experimental analysis. We implemented both the exact FCBM and its sparse-regime approximation. The code is released, as open-source software, as part of the Urban Social Networks (USN) framework at https:// gitlab. com/ cranic-group/ usn. We used our data-driven FCBM to generate instances of a spatial social network for a stylized city of 10K inhabitants living in disk of radius 2.5 Km. We set p so that the average degree of the network is µ = 25 , we set β = 1 , we used age-density data for Italy as released by the Italian National Institute of Statistics (ISTAT), and we extracted the matrix S from data released by the POLYMOD project 21 . We considered three possible spatial densities-uniform, and increasing or decreasing with the distance from the disk's center-and three possible fitness distributions-pareto, lognormal and exponential. For all nine combinations, we generated 10 independent graph instances.
In Fig. 1 we show the empirical degree distribution, averaged over the 10 instances of the FCBM for each configuration, considering both the exact model and the sparse-regime approximation. We also show the two estimates (3) and (4). The plots confirm that the approximation is sound and that the two estimates can be safely www.nature.com/scientificreports/ used in practice-at least, for sparse networks-with (3) working especially well in all cases-except, possibly, for very small degrees.

Discussion
Developing realistic network models for social interactions is of paramount importance to understand the underlying mechanisms that lead to the observed features of such networks and to study all dynamical processes, such as disease spreading, that are strongly influenced by the network topology. Extreme care must be taken to avoid that any bias is unintentionally injected in the model. For this reason, maximum-entropy models are extensively used in network analysis, either as null models, or to generate synthetic networks that have specific characteristics, but are otherwise maximally random.
In this paper, we introduced the maximum-entropy Fitness-Corrected Block Model (FCBM), an adjustabledensity model for modular and heterogeneous networks, whose block-wise mixing pattern is known in expectation. The model has a general and flexible formulation, but it was designed with a key application in mind: providing a working tool for building synthetic social networks informed with census, survey and geospatial data. Publicly available spatial density and demographic data are in fact necessarily discrete, thus inducing a partition of the population into blocks of agents who belong to the same age-group and live in the same area. The expected block-to-block edge-density can be estimated based on empirical findings that quantify the dependence of contact frequencies upon geographic and demographic features. However, both the density and the degree of heterogeneity of real-world social networks depend on the considered type of interpersonal relations and are rarely explicitly known beforehand. Contrarily to other block-models in the literature, the FCBM makes both these network features adjustable. In particular, the desired intra-block heterogeneity can be enforced through a vertex-intrinsic social fitness, possibly modelled upon observable population-level variables, such as wealth, employment or mobility.
We implemented the FCBM and made it publicly available as open-source software. The released software includes a parallel code that speeds up the computation of the most expensive part of the required maximization procedure, a step of the algorithm that is also needed in the well-known Degree-Corrected Block Model. We tested our implementation of the FCBM by reconstructing instances of a social network connecting the individuals of a stylized city of 10K inhabitants. The experiments allowed to verify that the exact model and its efficient sparse-regime approximation yield networks with almost identical degree distributions. We also showed and experimentally verified that, in the sparse-regime, the expected degree distribution of the output network can be estimated by two closed-form expressions. Thanks to these two estimates, the shape of the degree distribution can be predicted based on the chosen fitness distribution.
In the next future, we plan to use the FCBM-and, possibly, a temporal extension of the model-to simulate dynamical processes in real-world territories and understand how socio-demographic features and social habits affect the outcomes of these processes. To this end, we will work towards gaining a better understanding of how the topological properties of the FCBM depend on the configuration parameters, with special attention paid to a set of network properties, such as the clustering coefficient and the excess degree distribution, that can be used to study percolation and diffusion dynamics on the network.

Related work
Under the pressure of the COVID-19 pandemic, several simulation frameworks have been developed to provide realistic descriptions of the disease spread process on different spatial and temporal scales, from a single building to complex urban areas up to a global scale [22][23][24][25][26][27][28][29] . World-scale meta-population models and agent-based systems describing small and large areas can be informed by a variety of data sources. Census data and/or grid based population counts can be integrated to reconstruct populations that are statistically indistinguishable from real ones, including age, geographic distribution, education and wealth. The number and intensity of contacts in specific settings, such as workplaces, schools, or households, can be collected through surveys, questionnaires, diaries, and, if possible, supplemented with data obtained from digital technologies such as cell phones or wearable sensors [18][19][20] . These data are then used to reconstruct contact matrices, individual or group schedules, and are widely used to reproduce synthetic interactions 30,31 .
In comparison, generative network models that can reproduce the characteristics of real-world social networks by incorporating information from data, have received much less attention. The task of inferring a realistic distribution of social ties is quite challenging, since friendship ties can only be measured for a small subset of real-world networks, and the mechanism underlying tie formation, while thoroughly studied, is still far from being fully understood. Ideally, the models should reproduce the main features of real-world social networks, well summarized in Ref. 32 . These networks show a heavy-tailed (e.g, lognormal) degree distribution, often with a finite cutoff in agreement with Dunbar's number. The transitivity of the networks is high, compared to a random graph model, as a consequence of the well-established principle that "friends of my friends are my friends". Moreover, they show positive assortativity by degree and type. By quantitatively looking at ego networks, mobile phone networks, and online social networks we now have a better understanding of some of their peculiar features and underline mechanisms. Education, wellness, age and spatial proximity are regarded as critical elements in the formation of friendship bonds 33,34 . Among these, age is probably the most studied, possibly due to the fact that age-related data is actually available at different spatial scales 35 . While there is wide evidence that geographical factors alone cannot explain the structure of real-world spatial social networks 12,36,37 , the dependence of friendship on distance is widely assumed to follow an inverse power-law with exponent β ∈ [0.5, 2] 12,13,36-41 -and this surprisingly holds even for online relationships 42 . In particular, β < 1 seems to work better for short range contacts ( < 20 km) 13 and for urban networks 40 , in line with sociological studies 43 . However, contrary to other real-world networks 44 www.nature.com/scientificreports/ skewed and relatively long-tailed 12,37 , and it has been, at times, approximated by a power-law with a large (5-8) exponent 38,46 or by a lognormal distribution 13 . Within cities, population density impacts on the frequency of close-range contacts, but usually not on the overall size of each person's network 41 . While geographical proximity and community structure appear to be related 37,40,41 , some authors argue that only small clusters ( < 30 members) are geographically bounded 39 , whereas the large ones may span across very large areas of a city 37 . Defining simple models that capture all of these features is not an easy task. Models designed to mimic the scale-free degree distribution emerging in many real networks, for instance, may fail to yield the expected clustering structure 47,48 . Exponential random graphs have been shown to overcome some of these limitations 49,50 . Stochastic Block Models (SBMs) 2,3 , on the other hand, have been specifically developed to reproduce networks with a community structure, a typical characteristic of real-world social networks, which follows from the presence of some kind of group-level homophily. In this type of network models, the nodes are partitioned into disjoint sets named blocks and the probability of an edge existing between two nodes depends on the blocks to which the two nodes belong. The SBM and its generalization have gained their success in the last decades as they can be used to discover and understand the structure of a network, as well as for clustering purposes 51,52 .
Spatial network models are often obtained by incorporating vertices into a metric space and induced constraints can determine some of the network properties 53 . Introducing a penalty on "long" edges, which mimics a penalty in maintaining long-distance relationships, has an impact on clusters, path lengths, degree distributions, and more 54 . Recently, network instances having suitable features have been generated by means of the so-called random geometric models [55][56][57] , where the popularity and similarity of the nodes depend on their position in some latent metric space 58 . Embedding the vertices into a hyperbolic disk 55 has proved a way to obtain both high clustering and heavy-tailed degree distribution.
In the present paper, we leverage on the framework of entropy-based null-models for real complex networks, revised in Ref. 7 . Among the very first fundamental papers, the work of Park and Newman has a particular relevance 59 : based on Jaynes' derivation of Statistical Physics from Information Theory 60 , they proposed a general maximum entropy approach for the randomization of complex networks. Among the extension to a different context, the main innovations of Ref. 59 are the introduction of local constraints, as the degree sequence, and the interpretation of the general framework of Exponential Random Graphs (ERGs) in terms of maximum entropy models. The present construction was later extended to the analysis of real networks 61,62 , tailoring the entropybased model on the observed network. As a matter of fact, the various Lagrangian multipliers, introduced for the entropy maximisation in Ref. 59 , can be numerically calculated by maximising the (Log-)Likelihood associated with the real network. Such a construction represents a perfect benchmark for the analysis of real systems, since it is maximally random (due to the entropy maximisation) and tailored on the observed system (due to the Likelihood maximisation). It is not surprising that it has been extensively applied to the study of non trivial structural patterns of different systems, as financial and trade networks, biological systems and online social networks 7,63 . Moreover, the general framework can be easily extended to tackle different kinds of networks, as undirected, directed, weighted, directed and weighted [64][65][66] , bipartite 67 , bipartite weighted 68 and degree corrected block models 4 . Another relevant branch of research focuses on the reconstruction of networks from limited information 69 . This application is of particular interest for risk assessment of financial networks, and limited information is available due to privacy concerns 70 . Reconstruction approaches based on entropy-based null models have proven particularly effective in this context, and the maximally random nature of the framework described here is critical to avoid the introduction of bias into the predictions 69-73 .

Methods
Formalism. Let G be the ensemble of all simple graphs of N vertices. If P is a probability distribution over G , P(G) is the probability of graph G ∈ G , and �·� P denotes the expectation with respect to P. The vertex set V = {v i } N−1 i=0 is partitioned into n blocks {B I } n−1 I=0 . The size of block I is N I = |B I | and, for each v i ∈ V , I i denotes the index of the block to which v i belongs. For all pairs I, J, N IJ denotes the number of possible pairs (i, j) with v i ∈ I , v j ∈ J and i = j , i.e., N II = N I (N I − 1) and N IJ = N I N J if I = J-notice that in the case of N II we are counting twice the number of couples, as we will do for the counts of edges in the following.
Each G ∈ G is uniquely determined by its adjacency matrix . The total degree of block I is deg I (G) = i∈I deg i (G) and, for all I, J, L IJ (G) = i∈I j∈J a ij (G) is the number of edges between B I and B J in G, or, if I = J , twice that number. Therefore, using the definitions, deg I (G) = J L IJ (G) . For the sake of simplicity, the dependence of these quantities on the specific graph G will be often omitted in the following.

Maximum entropy Degree-Corrected Block Model 4 . The maximum entropy Degree-Corrected
Block Model (DCBM) is defined as the maximum entropy probability distribution P over G in which the number of links per block and the degree sequence are constrained on average, i.e.
with J K IJ = i∈I k i , for all I. If H(P) denotes the Shannon entropy of P, the sought P can be obtained by finding the stationary points of (7) L IJ P = K IJ for all I, J www.nature.com/scientificreports/ where η IJ , θ i and α are Lagrange multipliers: while, η IJ and θ i control the conditions (7) and (8), respectively, α is necessary for the normalization of the probability P(G). Since the functional derivatives with respect to P(G) are d�L IJ � P dP(G) = i∈I j∈J a ij (G) and d�deg i � P dP(G) = j a ij (G) , then This results in and the probability per graph factorises in terms of probabilities per link as where x i = e −θ i and y IJ = e −η IJ . For all i and all I ≤ J , x i and y IJ can be found by solving the system of equations Sparse DCBM. When the average degree k is small, i.e., when the network is sparse, the edge probability of the DCBM can be approximated as p ij ≈ x i x j y I i J j . This allows to rewrite (11) and (12) as Since the constants K and k are, by construction, bound by the relation J K IJ = deg I = i∈I k i , (13) and (14) admit the following solution Maximum entropy Fitness-Corrected Block Model. Given a scalar p ∈ (0, 1) , a symmetric matrix such that I,J I,J = 2 , and a fitness sequence f , we define the Fitness-Corrected Block Model (FCBM) as the maximum entropy model fulfilling the following two conditions: The FCBM is a variation of the DCBM where the network density p is a configuration parameter. As in the original stochastic block model, the block-level mixing is specified in terms of a set of edge-densities IJ , rather x i J y I i J j ∈ J j � = i x j = k i for all i www.nature.com/scientificreports/ than a set of edge-counts K IJ . Similarly, the degree sequence k is replaced by a vertex intrinsic fitness sequence f , in line with previous models available in the literature 5,6 . f i measures v i 's propensity to establish links and deg i P is set proportional to f i by a constant that depends on I i , other than p. By design, (15) and (16) imply J L IJ P = deg I P = i∈I deg i P , and (16) can be rewritten as which clarifies the role of f as an element of intra-block heterogeneity.
For fixed and f , the derivation of the maximum entropy FCBM is identical to that of the DCBM: the maximum entropy probability per graph factorizes into the probability per edge given by (10) and the vectors of constants x and y can be obtained by solving the analogous of (11) and (12), i.e.
Sparse FCBM. In the sparse-regime, i.e, when p ≪ 1 , using p ij ≈ x i x j y I i J j yields the following approximate solution to (17) and (18): In this regime, the probability per edge can thus be rewritten as Degree distribution for the sparse FCBM. Let us assume that each fitness value f i is drawn from a suitable distribution p f . The sparse-regime approximation allows to estimate the expected degree distribution of the sampled graph G with respect to both p f and the graph sampling probability P. If N I i is large enough, we have u∈I i f u ≈ f p f N I i . Now, following the approach used in Ref. 5   Data-driven FCBM for spatial social networks. Our FCBM can be easily tuned upon real data and empirical findings to produce instances of a maximum entropy spatial social network. On one hand, the local density and demographic profile of the population are generally available in the form of discrete, geographically located (e.g., residents in 500 m × 500 m tiles) and/or age-stratified (e.g., 0-5 years old) population segments. These data naturally induce a partition of the population into blocks. On the other hand, intra-block population heterogeneity can be controlled by a vertex-related social fitness, possibly modelled upon measurable features such as wealth, employment or mobility. Formally, let the vertex set V describe an age-stratified population of N individuals living in a territory tessellated into square tiles of side l. Each v i is characterized by two data-driven discrete attributes: its tile of residence t i ∈ T , that is, the discretized position of v i in the territory, and its age-group g i ∈ Ŵ . t i and g i may either be directly available-in the case of a real population-or be drawn, respectively, from given spatial density p t and age-distribution p g -in the case of a synthetic population. These two attributes induce a partition of the population into n = |T| · |Ŵ| blocks {B I } n−1 i=0 , with v i ∈ B I = (t I , g I ) if and only if t i = t I and g i = g I . We embrace the widely-acknowledged assumption that an inverse-power-law relation binds the distance d IJ and the frequency of social relations between individuals living in t I and t J 12,13 . For all pairs of blocks I, J, we thus define the edge-density where d IJ is the normalized (geographic or euclidean) distance between tiles t I and t J ; the normalization is obtained through a division by l 2 . We set d II = 1 , so that the distance between individuals in the same tile is half the distance of individuals living in neighboring tiles. β > 0 is a configuration parameter. s IJ measures the tendency of age groups g I and g J to socialize with each other; such a |Ŵ| × |Ŵ| symmetric age-based social mixing matrix S can be obtained, by imposing reciprocity and normalizing, from a suitable data-driven contact matrix 9 .
Finally, we extract the fitness vector f , set the density parameter p ∈ (0, 1) and, for all pairs i, j, compute the edge probability p ij as described for the FCBM. As a result, the graph sampling probability P guarantees that: (1) the expected number of links between B I and B J is proportional to s I,J and decays as d −β I,J ; (2) the expected degree of v i is proportional to f i and to the expected total degree of block I i .
In this case, the sparse-regime approximation yields Expression (25) defines a phenomenological model, where the probability of two individuals being connected is proportional to their sociability and to the cohesion of their age-groups, while decaying as a power of their distance. Clearly, the estimates obtained in (22) and (24) for the degree distribution stay valid in the data-driven model. If the network is sufficiently sparse and the population of all tiles/groups is sufficiently large, the degree distribution of the sampled graph G is controlled by the available data and by the chosen fitness distribution p f .

Data availability
All code and data used in this paper are available at the following public repositories: https:// gitlab. com/ cranicgroup/ usn and https:// gitlab. com/ cranic-group/ dcbm_ solver.